Error-Tagged Learner Corpus of Czech
نویسندگان
چکیده
The paper describes a learner corpus of Czech, currently under development. The corpus captures Czech as used by nonnative speakers. We discuss its structure, the layered annotation of errors and the annotation process.
منابع مشابه
CzeSL – an error tagged corpus of Czech as a second language
Using an error-annotated learner corpus as the basis, the goal of this paper is two-fold: (i) to evaluate the practicality of the annotation scheme by computing inter-annotator agreement on a non-trivial sample of data, and (ii) to find out whether the application of automated linguistic annotation tools (taggers, spell checkers and grammar checkers) on the learner text is viable as a substitut...
متن کاملCreating a manually error-tagged and shallow-parsed learner corpus
The availability of learner corpora, especially those which have been manually error-tagged or shallow-parsed, is still limited. This means that researchers do not have a common development and test set for natural language processing of learner English such as for grammatical error detection. Given this background, we created a novel learner corpus that was manually error-tagged and shallowpar...
متن کاملImprovements to Korektor: A Case Study with Native and Non-Native Czech
We present recent developments of Korektor, a statistical spell checking system. In addition to lexicon, Korektor uses language models to find real-word errors, detectable only in context. The models and error probabilities, learned from error corpora, are also used to suggest the most likely corrections. Korektor was originally trained on a small error corpus and used language models extracted...
متن کاملBuilding and Using Corpora of Non-Native Czech
Investigating language acquisition by non-native learners helps to understand important linguistic issues and develop teaching methods, better suited both to the specific target language and to the learner. These tasks can now be based on empirical evidence from learner corpora. A learner corpus consists of language produced by language learners, typically learners of a second or foreign langua...
متن کاملAnnotating foreign learners’ Czech
One of the challenges of contemporary corpus linguistics is the compilation and annotation of corpora consisting of texts produced by non-native speakers. In addition to morphosyntactic tagging and lemmatisation, such texts can be annotated by information relevant to the specific nonstandard use. Cases of deviant language use can be corrected and identified by a tag specifying the type of the e...
متن کامل